Day 5. 資料前處理 [R]

2022 iThome 鐵人賽

DAY 5

AI & Data

機器學習與資料視覺化的筆記[R、Python]系列第 5 篇

14th鐵人賽 r 資料前處理

yylee12

團隊NTUEPM_STAT LIFE

2022-09-16 06:42:29

1960 瀏覽

分享至

資料前處理

篩選資料
合併資料
切分、分割資料
其他資料整理

在進行資料分析時最耗時的步驟其實是資料的前處理，他會花掉我們大半的時間~~
今天要整理的就是一些整理資料時會使用到的程式碼。

篩選資料

篩選data.frame的資料

code	指令
`dataFrame[row index,column index]`	輸入index位置篩選資料
`dataFrame["row name","column name"]` `dataFrame[,"col_name"]`	輸入欄位名稱篩選資料
`subset()` `subset(dataFrame,篩選邏輯,select=c(col_name))`	資料筆數(row)的篩選資料

以 Iris data 舉例，
如果我今天想要知道Sepal.Length大於5的花是什麼種類(Species)，以下兩種方法的結果都是一樣的:

data(iris)
iris$Species[iris$Sepal.Length>5]
subset(iris,iris$Sepal.Length>5)$Species

合併資料

code	指令
`rbind(x, y)`	透過row合併資料
`cbind(x, y)`	透過column資料
`merge(x, y, by = "col_name")`	透過特定欄位("col_name")進行合併
`join( )`	指定某資料為主合併

merge( )

merge(x, y, by = "col_name") # 將 data.frame 透過特定欄位("col_name")進行合併。
merge(x, y, by = "col_name", all = T) # all = T 合併所有資料
merge(x, y, by = "col_name", all.x = T) # 只合併x(第一個資料)有的資料 
merge(x, y, by = "col_name", all.y = T) # 只合併y(第二個資料)有的資料

join( )

install.packages("dplyr")
library(dplyr) #載入dplyr套件
inner_join(x, y, by = "col_name" ) # 取交集，保留有對應到的資料
left_join(x, y, by = "col_name" ) # 保留第一個資料x的所有資料
right_join(x, y, by = "col_name" ) # 保留第二個資料y的所有資料
full_join(x, y, by = "col_name" ) # 保留所有的列

切分、分割資料

如果我今天想要將資料分割成兩個資料集(training, testing data)，
以下提供兩種分割資料集的方式:

第一種使用ISLR套件:

data<-iris # data代入你的資料

install.packages("ISLR")
library(ISLR)

smp_siz = floor(0.8*nrow(data))  # 0.8可以改為其他的比例，計算抽樣資料筆數
smp_siz # sample size 抽樣資料筆數

set.seed(1) # 固定抽樣的種子序

train_ind = sample(seq_len(nrow(data)),size = smp_siz) #抽樣index, sample(整個data的index, 抽出哪些index)
training_data = data[train_ind,] # 抽出index的資料
testing_data = data[-train_ind,] # 除了training_data以外的資料

## 確認資料筆數
nrow(data)
nrow(training_data) 
nrow(testing_data)

第二個使用caTools套件:

data<-iris # data代入你的資料

install.packages("caTools")
require(caTools)  # loading caTools library

set.seed(1) # 固定抽樣的種子序
sample = sample.split(data, SplitRatio = 0.8) # 0.8可以改為其他的比例，藉由SplitRatio(分割比例)，給予各筆資料TRUE、FALSE值
training_data = subset(data,sample == T ) 
testing_data = subset(data,sample == F )

## 確認資料筆數
nrow(data)
nrow(training_data) 
nrow(testing_data)

其他資料整理

class( )
colnames( )
recode( )
as.type( )
sort( )
ifelse( )

class(變數名稱)
得知變數的類型。
colnames(data) <- c("name_col1","name_col2",...)
更改變數欄位名稱。
recode( )
重新命名、編碼。

recode( data$column, " 'old_name1'='new1' ; 'old_name2'='new2' ")

as.numeric、 as.factor()
轉換型別為數字as.numeric( )、類別as.factor( )。
sort( )排序
sort( )為由小到大排序，sort( ,decreasing = T)由大到小排序。
ifelse(data$col判斷式, A, B)
如果符合判斷式，給這筆資料賦予A值，其餘資料賦予B值。
以 Iris data 示範:

data<-iris

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

如果 Sepal.Length 大於 5 ，新增的變數欄位 Sep.Length.5 給予它數值 1 的值。
其餘情況 Sep.Length.5 給予數值 2。

data$ Sep.Length.5 <- ifelse(data$Sepal.Length > 5, 1,2)

> head(data)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sep.Length.5
1          5.1         3.5          1.4         0.2  setosa            1
2          4.9         3.0          1.4         0.2  setosa            2
3          4.7         3.2          1.3         0.2  setosa            2
4          4.6         3.1          1.5         0.2  setosa            2
5          5.0         3.6          1.4         0.2  setosa            2
6          5.4         3.9          1.7         0.4  setosa            1

Species 是否包含 "setosa"，如果是的話，新增的變數 col_setosa 會有值"Is_Setosa"。
沒有"setosa"的資料會給予值"Not_setosa"。

data$col_setosa <- ifelse(data$Species %in% c("setosa"), "Is_Setosa","Not_setosa")

%in%: 左變數作為右函式的引數

> head(data)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species col_setosa
1          5.1         3.5          1.4         0.2  setosa  Is_Setosa
2          4.9         3.0          1.4         0.2  setosa  Is_Setosa
3          4.7         3.2          1.3         0.2  setosa  Is_Setosa
4          4.6         3.1          1.5         0.2  setosa  Is_Setosa
5          5.0         3.6          1.4         0.2  setosa  Is_Setosa
6          5.4         3.9          1.7         0.4  setosa  Is_Setosa

[補充] dplyr package，套件內也有許多方便資料清理的函式可以使用。